fix(core): scope graph exports to selected snapshot pages#104
Open
Noctivoro wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Thank you for maintaining Crawlith — I ran into a small snapshot/export edge case while using Crawlith for SEO internal-link graphing and opened #103 with the reproduction details.
This PR scopes page loading for non-
singlesnapshots to pages whoselast_seen_snapshot_idmatches the selected snapshot. That keeps graph exports aligned with the current crawl's normalization policy instead of including pages that were only seen in older snapshots.Fixes #103.
Problem
If a site is first crawled with query strings preserved and later crawled with
--no-query, the latest export can still include older query-URL nodes such as:The crawler is normalizing new discoveries correctly, but
loadGraphFromSnapshot()ultimately relies on page repository snapshot queries that include pages first seen in older snapshots. That makes--no-queryappear ineffective in exported graph nodes.Changes
getPagesBySnapshot()top.last_seen_snapshot_id = ?for non-singlesnapshots.getPagesIteratorBySnapshot()the same way for graph loading/export.getPagesIdentityBySnapshot()to the current snapshot so edge materialization uses the selected snapshot's page set.singlesnapshot behavior, which is still metrics-scoped.@crawlith/core.Validation
Ran locally:
pnpm run lint pnpm test pnpm buildAlso verified manually against a real crawl workflow:
--no-querywithout cleaning the DB.queryUrlNodes: 0in the exportedgraph.json.Thanks again — happy to adjust the query semantics if you'd prefer a different snapshot-scoping approach.